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(57) ABSTRACT 

This invention is a software-based method for facilitating 
the recovery* of a RAID storage system from the simulta- 
neous failure of two or more disks (catastrophic failure}. It 
involves the identification of the logical address and length 
of the failed areas of the failed disks and the writing of this 
information into a bad region table which is replicated on 
each disk. This makes it easier and less expensive to identify 
the problem areas and make the necessary repairs. 
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MULTIPLE-CHANNEL FAILURE attempt is made to restore the failed disk through redun- 

DETECTION IN RAID SYSTEMS dancy. 3. The failed disks are replaced and reconstructed. 

Other classifications of failures have been proposed, for 

CROSS-REFERENCE TO RELATED example, the following: 1. Transient failures. Unpredictable 

APPLICATIONS, 5 behavior of a disk for a short time. 2. Bad sector. A portion 

Not Applicable °^ a WQ i cn cannot be read, often for physical reasons. 

D P 3. Controller failure. The disk contents are unaffected, but * 

STATEMENT REGARDING FEDERALLY because of controller failure, the disk cannot be read. 4. Disk 1 

SPONSORED RESEARCH OR DEVELOPMENT. fai J ure : ^ ? nt ^ re dis ^ becom . e f unreadable, generally due/ 

10 to hardware faults such as a disk head crash. Pankaj Jalote, 9 

Not Applicable. Fault Tolerance in Distributed systems, Prentice hall, Engle- , 

wood Qiffs, N.J., 1994, pages 100-101. / 
BACKGROUND OF THE INVENTION Disk mys which allow writeback-caching are subject to 

(1) FIELD OF THE INVENTION me danger of losing data which have been accepted from the 
This invention relates to RAID systems in which 15 host °° D TSS bUt ^ I*™ a 

multiple-channel failure is detected and the diagnostic infor- arr f ? MD D *J^T ?° D ° r6( ?" ndancy and n ,° er ™ 

. r , j & protection. RAID 1-5 systems provide error correction for 

mation recorded, \_ . c , . , «. . . 4 . „ 

the loss of a single channel through panty methods. Error- 

(2) DESCRIPTION OF RELATED ART INCLUDING det6Ction in the event of multi le c^i failure, however, 
INFORMATION DISCLOSED UNDER 37 CFR 1.97 AND cannot be ^ Under ^ circumstances , data may 
j r i.7o. ™ ^ e correctly written on some channels but not on others, a 

The acronym RAID refers to systems which combine disk falsely valid parity might be returned, and corrupted data 

drives for the storage of large amounts of data. In RAID may be returned. If the unit must be powered down to correct 

systems the data is recorded by dividing each disk into the situation before the array can be brought back online, 

stripes, while the data are interleaved so the combined there may be no opportunity to rewrite the data successfully 

storage space consists of stripes from each disk. RAID 25 and live write-back data may also go unwritten, 
systems fall under 5 different architectures, plus one addi- FauIty cache mem ory may produce apparent multiple- 

tional type, RAID-0, which is simply an array of disks and disk crrois of a persistcnt nature . For example, cache data 

does not offer any fault tolerance. RAID 1-5 systems use ^ inC orrcct parity may generate bad SCSI parity on both 

various combinations of redundancy, spare disks, and parity mc data channd and on the parity channcl In this ^ wheQ 

analyse; to achieve conservation reading and writing of data 30 a write to disk k performed, two disks will report that the 

in the face of one and, in some cases, multiple intermediate da j a m mV alid 

or permanent disk failures. Ridge P. M.TheBookOfSCCSI: ^ sharin of one bus ^ m ^ ^ {& CQm _ 

1995 l^At emUrerS ' X ty monl y done on RAID s 5! stems . creates a sto S le P° int ° f 

P- • failure in the bus which increases the probability of "two 

It is important to note that multiple disk failures channel" failure. For example, in an array of five channels 

(catastrophic failure) are not supposed to occur in RAID (f our data channels and one parity channel) with each 

systems. Such systems are designed so disk failures are channel serving five disks, the failure of a single bus means 

independent and the possibility that a second disk will fail than an error on any one of the 20 disks on the four other 

before the data on a first failed disk can be reconstructed will data channels will be unrecoverable. This has the same effect 

be minimal. In order to shorten this susceptible period of as a two channel failure. 

"degraded" operation, a spare disk is frequently provided so In the present a table of address ranges which 

the reconstruction of the failed disk can begin as soon as a have not ^cn successfu ii y written t0 a parity str ipe is 

failure is detected. Nevertheless, multiple disk failures do rep iicated on one disk on each channel in the array with 

occur for a number of more or less unlikely reasons, such as frequent updating , Af ter a catastrophic failure of multiple 

a nearby lightning strike causing a power surge, or a physical diskSj ^s^g at i east one 0 f those disks can be written to, 

tremor shaking the disks and disrupting the read/write heads mere will be a record of the failure 0fl SQme disk since the 

over multiple disks. Such events can create logically invalid record ^ 0Q many disks , rather than onl on ^ disk which 

regions This invention is equally useful for identifying expe rienced the failure, the controller can generate a list of 

logically invalid regions of disks whether the region in aU regions where data have been lost after the array has been 

question is also physically bad. repaired, even if the unit must be powered down before such 

Multiple disk failures may be classified in two categories: a repair ca n be performed. This reduces the loss of down 

"~ A. local or B. transient failures. Such failures stem from time for the system and reduces the cost of restoring the 

medium errors, localized hardware erforsysuch as corruption > system. 

of track data, and bus errors. Type A and B errors are handled 55 The RAID Advisory Board has provided a summary of 

by retries. The retries are made automatically; the number of, x criteria for the classification of RAID systems with respect 

retries depends on the number of disks in the array and the r to reliability. Http://www. raid- advisory.com/ 

demands on the system, including the errors detected in the/ EDAPDef.html. It is expected that the present invention will 

other disks of the array. / be useful in the development of "Failure Tolerant Disk 

C. Burst or severe errors. Such errors are seen over a large go Systems (FIDS) and Disaster Tolerant Disk Systems 

range of addresses or cause the disk to become inaccessible (DTDS). 

after an attempt is made to access a certain region. Type C U.S. Pat. No. 4,598,357 discloses a system in which data 

errors are handled by failing a disk with powering down of involved in a writeback error are reassigned to an unused 

the entire system. Type C errors are also referred to as portion of a working disk. The location of areas from which 

"catastrophic" errors. 65 data have been lost are not recorded. 

A system which is downed by a type C error is restored U.S. Pat. No. 4,945,535 discloses an address control 

by the following steps. 1. The system is repowered. 2. An device which, when it detects an error in a data word read 
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from a main memory device, changes the address of that U.S. Pat. No. 5,600,783 discloses a disc array system in 

error and does not use the memory area in subsequent data which data for a faulty disc is stored in a cache until the disc 

writes. is replaced. 

U.S. Pat. No. 5,166,936 discloses a method for automati- U.S. Pat. No. 5,617,425 discloses an array supporting 

cally remapping a disk by removing a bad sector and 5 system in which drive controllers accept responsibility from 

replacing it with a good track of data. Aflag is set during the the array controller for detecting write errors and reallocat- 

process so that should power fail the process can be ing data away from faulty discs. 

restarted. U.S. Pat. No. 5,636,359 discloses a performance enhance- 

U.S. Pat. No. 5,249,288 discloses an electronic printing ment system which uses a directory means to prevent errors 

system which identifies physically bad areas and remaps 10 in the reading and writing of data. 

them through file allocation. U.S. Pat. No. 5,644,697 discloses a redundant array of 

U.S. Pat. No. 5,271,012 discloses a RAID system tolerant disks in which the disks are divided into areas of varying 

to failure of two disks which uses the double generation of size and having a single status table which indicates which 

parity information using alternate rows and diagonals of areas are in use. 

direct access storage devices. 15 U.S. Pat. No. 5,657,439 discloses a system in which a 

U.S. Pat No. 5,274,799 discloses a RAID 5 system in logical region of a disk is used as a distributed spare for use 

which the copyback cache storage unit is used to store peak in recovering data having errors. 

load data and completes the write function during relatively Those prior art RAID systems tolerant to multiple disk 

quiescent periods. 20 failure exceeding the redundancy of the array depend on 

U.S. Pat. No. 5,285,451 discloses a mass memory system hardware, such as non-volatile memory or cache memory 

capable of tolerating two failed drives in which a number of with a battery or extra disks, to cope with writeback cache 

disk drives are coupled to an equal number of buffers by loss in the event of multiple disk failure. The present 

X-bar switches. The switches couple and decouple func- invention uses only software and a small portion of reserved 

tional and nonfunctional drives as necessary. 2 5 space on each disk to provide a reliable, inexpensive, widely 

U.S. Pat. No. 5,412,661 discloses a data storage system in applicable system for error-detection for write-back data lost 

which disks are arrayed and each disk is controlled by two during a catastrophic multiple disk failure, 

disk controllers. The system is tolerant of the failure of any BRIEF SUMMARY OF THE INVENTION 
one controller and has hot spare disks to accommodate disk 

failure. 30 *' Catastrophic disk-array failures involve the failure of 

U.S. Pat. No. 5,463,765 discloses a process in which greater than one disk in a RAID 1-5 system, or any disk in 

invalid blocks of data are stored in a new location and used a 0 svstera - Ia ^ mosi ali cases > however, there . 

to recover the data of the faulty drive. remains the ability of the controller to communicate with at 

U.S. Pat. No. 5,479,611 discloses an error-correction te £ ° nC ^ ^ i |he array. TTie pr^ent invention uses 

technique in which data from a bad block on a disk are 35 software and la small portion of« *ch dxsk m the array to write / 

reassigned and reconstructed without the use of a cache f bad f ea * able 00 ?** dl u sk ' ™ e bad area [ able P ro ^ es th <* 

memory logical address and length of the area in the array s logical 

rrr, « . * * m a * , . space which has been corrupted by physical damage on the 

U.S. Pat. No. 5,469,453 discloses a mass data storage me(Ka Qr Q ^ causes rf wrfte &ilure Aftef fl catasl hic 

apparatus in which bad blocks are time stamped and given Muie of ^ k ^ assumi a , ^ Qne djsk caQ ^ 

a logical address. Comparison of the addresses and time written , here ^ ^ a record Qf ^ failur(J on a , ^ Qne 

stamps allows determmation of failures of the writing ^ ^ record ^ Qn ^ Qr at ^ one ^ 

devices 

rather than only on the disk which experienced the failure. 

U.S. Pat. No. 5,526,482 discloses a fault-tolerant storage ^ 0 f repairing the array is greatly simplified because 

device array in which at least two redundant copies of each 45 all bad reg i ons 0 f the array can be easily identified. This 

pending data block are retained in the array controller's reduces.the loss of down time for the system and reduces the 

buffer memory and the copyback cache storage unit, pro- cost c f restoring the system 

viding protection against buffer failure. ^ pf0cess of recofds Qn Qne Qr more 

U.S. Pat. No. 5,548,711 discloses a system including a disks> as described in this patent> has the advantage of very 

DATA-RAM and a SHADOW-RAM. Write data from the 5Q rapidly reC ording the failure incident. It takes only about 20 

CPU is stored in two independent memories to insure that milliseconds to record to a disk. Recording to flash memory 

pending Write data are not lost. can require a si g nificant i y i onger time, up to several seconds 

U.S. Pat. No. 5,564,011 discloses a non-RAID system in m t h e worst case. The difference in recording time may be 

which critical data is replicated and used to regenerate failed cruc i a i uoder certain failure conditions, for example, in the 

control blocks. 55 case 0 f a power failure recording to disk could be accom- 

U.S. Pat. No. 5,572,659 discloses an adapter for mirroring plished while recording to flash memory would fail, 

information on two channels which detects the failure of one jhis invention is a process for designating physically or 

channel and reads and writes from the other channel logically invalid regions of storage units as a whole or 

U.S. Pat. No. 5,574,856 discloses a storage device array fractional number of blocks on storage units on which data 

in which data blocks of converted data are labeled with 60 has been striped, in a fault-tolerant storage device array of 

predetermined code bits which indicate the operation in a number of failure independent storage units which receive 

which a fault occurs. In the presence of a fault, a data information from a writeback cache and a controller with a 

reconstruction operation and a data reassignment operation writeback cache. First, the logical address and length of the 

are indicated. physically or logically invalid region is determined. Second, 

U.S. Pat. No. 5,574,882 discloses a system for identifying 65 the address and length is written on a bad region table, and 

inconsistent parity in an array of storage in which a bit map thirdly, the bad table region is replicated on each storage 

of inconsistent parity groups is created. unit. The process may be used with storage units which are 
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disks, tapes, or CDS which are connected to the controller. FIG. 7 is a flowchart of the subroutine which checks the 
The process may be used when the bad region is due to a validity of data loaded during a blocking/deblocking opera- 
writeback, a read, or a write error, and in a system in which tion to perform read -modify- write to the disks, 
a number of storage units exceeding the redundancy have piG. 8 is a flowchart of the routine which is invoked when 
failed, or when the data is being restored or replaced on a 5 an operation to the disks completes with bad status, 
spare storage unit or in a non-redundantly configured array. FIG 9 ^ a q{ ^ subroutine which checks a 
In addition to being replicated on two or more storage units, iven of ^ fof Q 
the bad region table may also be replicated m volatile f ioQS { ^ bad Mock ^ 
memory with battery backup in the controller, or on addi- 

tional storage units separate from the array of storage units. 10 FIG ' ^ 15 ■ ^hart of the subroutine which adds a 

Finally, the process can include the steps of time-stamping ran S e of lo S lcal blocks to the bad block table - 

entries and determining the most recent entry. FIG. 11 is a flowchart of the subroutine which deletes a 

This invention is especially useful in connection with a ran S e of lo S ical blocks the bad block table, 

host computer with a RAID system which is periodically FIG - 12 is a flowchart of the process performed when a 

backed up to tape and which participates in a distributed storage unit reports a medium error status on a read, 

system through a network. In the absence of this invention, FIG. 13 is a flowchart of the change range subroutine, 

corrupted data could be sent to the host and then propagated FIG s. 14a and Ub are flowcharts of the subroutine called 

through the network to other nodes in the distributed system. w h en a region of the array is determined to be invalid. 

With our invention, the data would be recognized as invalid CTr , c 1C fl . . . „ . 

, A . ,f • * j in r I OS. 15a and 15o are flowcharts of the subroutine called 

or lost by the host system, and there would exist no danger 20 „, u *„ « ah *. j , , *i_ 

r * j j . i • \i_ i i , i_ • , % when a write operation successfully commits data to the 

of corrupted data leaving the local node or being used for ctnraee arrav 

processing by the local node. The backup tape would then be y * 

used to roll the node back to an earlier, consistent configu- DETAILED DESCRIPTION OF THE 

ration. INVENTION. 

This invention is also especially useful in a system with _ 

two RAID arrays software-mirrored by the host computer's FIG - 1 ^ a schematic of the external view of a RAID 3 

operating system. After the occurrence of a catastrophic system comprising a single host computer, a RAID 

system crash, the host could reassemble all data which is controller, and two tiers of 5 Direct Access Storage Device 

correct on either of the two arrays, using the present inven- (DASD) units with two parity DASDs and two additional 

tion. Without this invention, the host could not identify 30 hot-spare DASDs which incorporates the invention. All the 

which blocks were in error on which RAID device. DASDs in a system taken as a whole is referred to as an 

The objective of this invention is to provide an inexpen- "f™^ of DASD t * f ou ? of DA ? Ds ?™* b * « 

sive means for identifying the locus of catastrophic failure of C ^° e * acro f ^ data * T**? ^f? ^ 

RAID 0-5 systems 0 DASDs. A DASD may be uniquely identified by a - 

A . ' . « - _ 35 channel number and a tier letter, for example DASD 1 A is 

Another objective is to provide means for rapid identifi- me ^ ^ t0 channel x of the ^roUer. 

cation or. failed areas in a RAID 0 system with no redun- A - , „ . . „ TTt „,. J 

da J A preferred controller is the Z-9100 Ultra- Wide SCSI 

A * .... „ „ RAID controller manufactured by Digi-Data Corporation, 

Another objective is to provide means for rapid identifi- Jessup Md 

cation of failed areas in RAID 2-5 systems which utilize 40 ~. ' , « A . , , 

parity to correct single disk failures. ^ host com P utci }° * ^_ ctcd ^ tbc h ? si sma11 

A ...... . , . computer system interface (SCSI) bus 20 to the disk array 

Another objective is to provide means for inexpensive controller 30. Disk array controller 30 is connected to DASD 

catastrophic failure identification which do not require hard- ^40 and DASD IB 41 via the channel 1 disk SCSI data bus 

ware other than small areas of the storage disks. 70; to DASD 2A 42 and DASD 2B 43 via the channel 2 disk 

Another objective is to provide software means widely 45 SCSI data bus 71; to DASD 3A44 and DASD 3B 45 via the 

adaptable to a variety of configurations of RAID 0-5 sys- channel 3 disk SCSI data bus 72; and to DASD 4A 46 and 

terns for identification of the sites of catastrophic failures. DASD 4B 47 via the channel 4 disk SCSI data bus 73; 

A final objective is to facilitate the rapid recovery of respectively. Parity DASD 5A 50 and 5B 51 are connected 

RAID 0-5 systems from catastrophic failure occasioned by 5Q to the Disk Array Controller 30 by the channel 5 SCSI parity 

physical or logical sources. disk bus 80. Spare DASD 6 A 60 and 6B 61 are connected 

to Disk Array Controller 30 by the channel 6 SCSI hot spare 

BRIEF DESCRIPTION OF THE SEVERAL disk bus 90. 

VIEWS OF THE DRAWINGS. Additional tiers of DASDs may be used. Additional host 

FIG. 1 is a schematic of the external view of the array, 55 channels and host computers may be used on the system, 

disk array controller, and host computer. suitable host computer may be used. 

FIG, 2 is a schematic of the system architecture showing FIG. 2 is a schematic of the system architecture of the disk 

only one channel. arrav controller (30 in FIG. 1) showing one disk channel and 

FIG. 3 is a schematic of the DMA sync hardware. °™ hoSt channe J 1 . ^ ^ e L flo L w of data b ! i tweea host and 

__ - . « . ri . . , 60 dlsk arrav 1S indicated by the heavy line. Data is received 

FIG. 4 is a flowchart or. the portion of the system boot from the host 

computer via the host SCSI bus 20 into the 

process relevant to the present invention. SCSI input/out p Ut processor (SCSI IOP) 31. The SCSI IOP 

FIGS. Sa and 5b are flowcharts of the routine which is initiates memory transactions to or from the cache memory 

invoked periodically to update the bad block tables on the 32 through the bridge chip 33 which bridges the system bus 

disks. 65 and the cache bus. A cache bus 22 connects the bridge chip 

FIG. 6 is a flowchart of the routine which is invoked prior 33, cache memory 32, and the hardware control mechanism 

to a data operation on the disk array. DMA Sync 34. The DMA Sync acts as a direct memory 
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access (DMA) controller with the additional RAID-3 func- 
tion of parity generation and checking and replacement of 
data with a hot spare. It also generates reads or writes to 
specific cache addresses and translates the data between the 
cache bus 22 and the SCSI interface chip 35 on the indi- 5 
vidual channel. The DMA Sync also controls the necessary 
hardware handshaking signals for direct memory access 
(DMA). Although only one SCSI interface chip 35 and SCSI 
disk bus 70 is shown in FIG. 2, there are as many of these 
components as there are busses of DASDs. The SCSI 10 
interface chip 35 is connected by connector 24 to bridge chip 
33. The DMA Sync 34 is connected by connector 23 to the 
bridge chip 33. The non-volatile flash electrically erasable 
programmable read-only memory (EEPROM) 39 stores 
parameters for the controller and the system firmware, which 15 
is uncompressed from the flash into 4 Megabyte local 
dynamic random-access memory (DRAM) 38 when the 
system is booted. A DUART chip 37 has two RS-232 
connectors 27 which allow the user to configure the unit 
through a command line interface and also provide commu- 2 o 
nication between the unit and a workstation running debug- 
ging software. The flash memory 39, local DRAM 38, 
DUART chip 37 and CPU 36 and system bus 21 are 
connected by a processor bus 25. Both the SCSI interface 
chip 35 and the DMA Sync 34 are programmed by the CPU 2 $ 
36 through the system bus 21 and the bridge chip 33. 

A preferred CPU 36 is an Intel 960RP available from Intel 
Corporation, Santa Clara, Calif . A preferred SCSI IOP 31 is 
a Symbios SYM53C875 available from Symbios Logic 
Incorporated, Colorado springs, Colo. A preferred system 30 
bus 21 is a 32-bit bus designed in accordance with the 
Peripheral Controller Interconnect (PCI) specification. A 
preferred SCSI interface chip 35 is a QLOGIC FAS366U 
Ultra Wide SCSI interface chip available from QLogic 
Corporation, Costa Mesa, Calif. 35 

FIG. 3 is a schematic of the DMA sync hardware. The 
controlling state-machine and related signals and registers, 
including handshaking signals, are omitted from this sche- 
matic. Data enter and exit the DMA sync (34 in FIG. 2) via 
the cache bus 22 and the individual channel interface chips 40 
in the SCSI Interface 35. Data enter and exit the DMA sync 
(34 in FIG. 2) via the cache bus 22 and data buses 110, 120, 
130, 140, 150, 160 to the individual channel interface chips 
35, 92, 93, 94, 95, 96. During a write from cache to disk, 
data from the cache bus is latched in two 36 bit registers 310, 45 
312 from which a parity byte is generated by parity- 
generating circuitry 320. The eight data bytes and two parity 
bytes are latched in two sets of five 9 bit registers 330, 332. 
The data are then rearranged to be byte-striped across the 
disks and are passed to a multiplexor 340 to provide data for 50 
the spare channel if it is to be used. For reads from disk to 
cache the process is reversed. 

FIG. 4 is a diagram of the data structure used for bad 
block management, herein referred to as the BBM MAP, 
which is maintained in the controller's memory. The BBM 55 
MAP consists of an array of individual records called the 
BBM TABLE 402. Each record describes a region which has 
been determined to be invalid. The fields contained in the 
record have the following meanings. LUN refers to the SCSI 
logical unit through which the host computer would access 60 
the invalid region. START ADDRESS indicates at what 
logical block address within the LUN the invalid region 
starts. EXTENT indicates for how many logical blocks from 
START ADDRESS the invalid region extends. The pointer 
NEXT is used as a linkage field so that the elements within 65 
the BBM TABLE 402 can be organized into linked lists. 
Additionally, the BBM MAP contains BBM FIRST FREE 
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404, a pointer to the first unused entry in the BBM TABLE, 
BBM LIST 406, a pointer to the first entry in the BBM 
TABLE 402 which describes an invalid region, and BBM 
DIRTY 408, a boolean value which indicates whether the 
table has been altered since it was last saved to the disk 
array. The unused entries in the BBM TABLE 402 are 
organized into a linked list of which BBM FIRST FREE 404 
is the head, and the entries describing invalid regions are 
organized into a linked list of which BBM LIST 406 is the 
head. In FIG. 4, the table is shown as containing two unused 
entries and two used entries. 

In the discussions which follow, the following notation for 
the comparison of entries in the BBM TABLE 402 will be 
employed to simplify the discussion. When used to compare 
entries for which the value of the LUN field is identical, 
START(X) will be taken to mean the value of an entry's 
START ADDRESS field, where X refers to the entry in 
question. Similarly, END(X) will be taken to mean the sum 
of the entry's START ADDRESS and EXTENT fields minus 
one, which is the last logical block address described by the 
entry as invalid. In comparisons between entries describing 
regions of different LUNs, it is defined that END(Y)+ 
1<START(Z) for any regions Y and Z for which the value of 
entry Y*s LUN field is less than the value of entry Z's LUN 
field. In this way, the address spaces of all of the LUNs 
represented by the controller unit are flattened into a single, 
larger address space for purposes of ordering and compari- 
son. The notation NEXT(X) will be taken to mean the entry 
referred to by entry X*s NEXT field. 

Since the maintenance of linked lists through insert and 
delete operations and the use of free lists to dynamically 
allocate elements from within a static structure are well 
known in the art, these will not be further elaborated. In the 
discussion which follows, operations which attempt to allo- 
cate an unused entry from the BBM TABLE 402, returning 
a failure code if none is available, and which return an entry 
no longer needed to the pool of free entries will be assumed, 
as will an operation to sequence the linked list structure prior 
to saving to disk and an operation to restore the original 
structure from an image on disk. The product of this 
operation, to which are appended a timestamp and a marker 
to indicate that the data is thus formatted, will be referred to 
as a BBM STAMP. All of these processes are well known in 
the art. Within the linked list BBM LIST 406 the additional 
property is preserved that for each entry A which is followed 
by entry B in the list, END(A)+1<START(B), which implies 
that the entries are ordered, that they do not overlap and that 
they are noncontiguous. 

FIGS. 5a and 5b depict the read-modify-write process 
known as blocking/deblocking, performed prior to certain 
write operations which a host computer may initiate by a 
controller which byte-stripes data onto a set of disks. Many 
devices, including host adapter cards and disks, support only 
a fixed size for logical blocks, which is the minimum unit of 
data which can be transferred to or from the device. For 
SCSI devices, this fixed size is frequently 512 bytes. Unless 
the host computer supports a block-size which is a multiple 
of the block-size supported by the storage units in the array 
multiplied by the number of storage units across which the 
data is to be striped, the possibility exists for the host 
computer to request a write which alters only a fractional 
portion of a disk block. In this case, the controller must read 
the data which is recorded on the block in question prior to 
performing the write in order to avoid corrupting the data 
which shares the same blocks in the storage array with the 
host data to be written. 

In FIG. 5a, a single block of data 502 is written by the host 
computer to the controller and placed in the controller's 



10/06/2003, EAST Version: 1.04.0000 



US 6,243,827 Bl 



10 



cache memory 504. Also shown in memory are three adja- 
cent blocks 508, 510, 512 which together with 502 map onto 
the same set of four blocks 514, 516, 518, 520 on four 
different devices within the storage array 506. Such blocks 
with logical addresses preceding the host data to be written 5 
508, 510 are referred to as a blocking/deblocking bead (BDB 
bead), and such blocks with logical addresses subsequent to 
the host data 512 are referred to as a blocking/deblocking tail 
(BDB tail). Arrow 522 depicts the process of host data being 
written into cache memory 504. Arrow 524 depicts the 10 
blocking/deblocking read of byte-striped data from the stor- 
age array into the cache memory 504. The portion of the data 
which corresponds to the block of host data 502 is shown in 
a dotted line to indicate that the data being transferred from 
the storage array 506 is blocked from overwriting the host 15 
data. In this way, the data from the host computer is merged 
with the other data which share the same set of blocks within 
the storage array 506. Arrow 526 depicts the combined data 
being written back to the storage array 506. 

FIG. 5b shows the effect of an unrecoverable physical 2 o 
medium error on one block in the storage array during the 
blocking/deblocking operation of FIG. 5a, assuming no 
operational parity disk is available. In this case, the only 
logically valid area of the blocking/deblocking region after 
the operation is the data written by the host, since the 2 5 
portions of blocks 508, 510 and 512 which were stored in 
block 520 have been overwritten with invalid data and the 
data from those blocks which resides on disk blocks 512, 
516 and 518 are not sufficient to reconstruct the entirety of 
the lost blocks. 30 

FIG. 6 is a flowchart of the steps pertaining to bad block 
management taken when the system boots. Block 602 is the 
entry point for the routine. In block 604 the system allocates 
memory for the BBM MAP structure and initializes it to 
contain no bad regions. Variables used to find a valid stamp 35 
in the disk array are initialized in block 604. The loop 
control variable DISK is set to indicate the first disk in the 
array, and a variable CANONICAL STAMP is set to a null 
value. In block 608 a stamp containing the bad block table ■ 
is read from the disk referred to by DISK into a local buffer 40 
BBM STAMP, which is checked for a valid stamp format in , 
block 610. If BBM STAMP is determined to be invalid, " 
control passes to block 616 where the presence of other disks 
is checked. If there is another disk, it is assigned to DISK in 
block 618 and the loop repeats. If there is no other disk,>45 
control passes out of the loop to block 620. If BBM STAMP 
is valid in block 610, its timestamp is checked against the 
timestamp of CANONICAL STAMP (where the timestamp 
for a null stamp is defined to be older than any valid 
timestamp) in block 612. If it is more recent than CANONI- 50 
CAL STAMP, CANONICAL STAMP is set to BBM 
STAMP. Control then passes to block 616 for the next 
iteration of the loop. When there are no more disks to check, 
control passes to block 620, where CANONICAL STAMP is 
checked for a null value. If it is not null, then an appropriate 55 
stamp has been located, and all the bad regions described in 
CANONI CALSTAMP are added to the BBM MAP in block 
622. This portion of the boot-up process is completed in 
block 624. 

FIG. 7 is a flowchart of the process CHECK BBM 60 
FLUSH which is performed periodically by the system tasks 
running on the CPU such that the process is guaranteed to be 
performed by a system task which alters the BBM MAP 
structure soon after that alteration is made. The process 
starts in block 702. In block 704, the BBM DIRTY bit of the 65 
BBM MAP structure is checked. If it is not set, the process 
completes in block 712. If the bit is set, a valid BBM 



STAMP is built in the controller's local memory containing 
the timestamp for the current time and all of the entries from 
the BBM TABLE which are linked in the BBM LIST in 
block 706. In block 708, write operations of the BBM 
STAMP onto a specific, reserved address of each disk in the 
array are initiated. The BBM DIRTY bit is then cleared in 
block 710 to indicate that the alterations to the map have 
been saved. The process then completes in block 712. 

FIG. 8 is a flowchart of the process CHECK TRANSFER 
which is performed prior to any operation which accesses or 
alters data stored in the array. The process starts in block 
802. In block 804, the direction of the operation is checked. 
If the operation writes data to the storage array, control is 
transferred to block 812, and the process completes. 

For read operations, control is transferred to block 806, 
and the source of the command for the data transfer is 
checked. If the command originated from the host computer, 
control passes to block 808. Otherwise, the command is 
internally generated by the controller (such as the recon- 
struction of a disk or parity verification of a healthy array), 
and control passes to block 814, where the subroutine 
CHECK RANGE is called for the range of logical addresses 
described by the command. If no invalid blocks are present 
in the range, control is transferred to block 812, and the 
process completes. If there are invalid blocks in the range, 
control passes to block 834, where the region of the con- 
troller's cache memory which would receive the remainder 
of the data starting from the first invalid block is marked as 
invalid in order to prevent the invalid data from being 
transferred to the host as a cache hit. A means for selective 
invalidation of cache blocks is assumed, since the manage- 
ment of cache memory, including the invalidation of specific 
regions of a cache, is outside of the scope of the present 
invention. After cache invalidation is performed, control 
passes to block 812, and the process completes. 

For read operations originating from the host computer, 
control passes to block 808, where it is checked whether the 
operation in question is a blocking/deblocking read opera- 
tion to be performed before a host write. If so, control is 
transferred to the subroutine CHECK BDB in block 810, 
where the blocks loaded during the blocking/deblocking 
process are individually checked for validity and invalidated 
in cache if necessary. Control then passes to block 812, and 
the process completes. 

For non-blocking/deblocking read operations originating 
from the host computer, control passes to block 819, where 
the subroutine CHECK BDB is called for the blocking/ 
deblocking head, if any, preceding the requested host data to 
prevent any invalid data from being transferred to the host 
as a cache hit. In block 820, the subroutine CHECK RANGE 
is called for the range of logical addresses to be read from 
the storage array. This range may include a blocking/ 
deblocking tail which will be loaded into memory along 
with the requested data. It may also include a lookahead, a 
number of blocks to be loaded sequentially after those 
blocks which the host requested. The technique of lookahead 
is well known in the art to enhance performance for hosts 
which are likely to access data from a storage system 
sequentially. In block 822 it is checked whether any of the 
data described by the command contains invalid blocks. If 
not, control passes to block 812, and the process completes. 
If there are invalid blocks in the range, control passes 
through connector 824 to block 828, where it is checked 
whether any of the data requested by the host is invalid. If 
not, control is transferred to block 826, and all of the 
requested data is transferred to the host, after which the 
invalid portion of the command is invalidated in the cache 
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memory to prevent the invalid data from being transferred to is known to contain data which the host uses, whereas the 

the host as a cache hit. blocking/deblocking data may be in unused space. The 

If the invalid blocks are determined to fall within the process begins in block 1102. In block 1104, BLOCK is set 

portion of the command requested by the host, control is to the logical address of the first logical block which would 

transferred to block 836, where it is checked whether the 5 have been loaded by the blocking/deblocking operation. A 

first block of data requested by the host is registered as call is made to the subroutine ADD BAD BLOCKS in block 

invalid. If so, control passes to block 838, where the entire 1106 to add that block to the BBM MAP. In block 1108, 

command is invalidated in the cache memory to prevent the BLOCK is set to the logical address of the next block which 

invalid data from being transferred to the host as a cache hit. would have been loaded, and if there is such a block, control 

A MEDIUM ERROR status is then sent to the host in block 10 passes through block 1110 to block 1106 for another itera- 

840, indicating the requested address as the location of the tion of the loop. When there are no more blocks, control 

block in error. If some of the data requested by the host is passes to block 1112, and the process completes, 

valid, control passes to block 830 where all of the valid data r FIG. 12 is a flowchart of the process which is performed 

requested is transferred to the host. In block 832, A when a storage unit reports a MEDIUM ERROR status on ^ 

MEDIUM ERROR status is then sent to the host indicating 15 a rea d./If the array is not in a redundant state, this process 

the first invalid address as the location of the block in error. win be called in the case of failure only after the prescribed 

Control then passes to block 812, and the process completes. number of retries have expired. Since the DMA sync hard- 

FIG. 9 is a flowchart of the subroutine CHECK BDB, ware automatically rebuilds the lost data into cache from 

which checks for invalid blocks loaded during a blocking/ parity without any additional read or parity-building 

deblocking read operation. The process begins in block 902. 2 o operations, no retries need be performed when the array is 

In block 904, the variable BLOCK is set to the first logical in a redundant state. Additionally, the block containing 

block address which will be loaded by the operation. In physically flawed media may be remapped to a different , 

block 906, the subroutine CHECK RANGE is called for the location, either automatically by the storage device or using 

range of a single block at the address BLOCK. If the block the SCSI REASSIGN BLOCKS (0x07) command. The 

is labeled as invalid in block 908, it is invalidated in cache 2 5 process begins in block 1202. In block 1204, it is determined 

in block 914. In block 910, BLOCK is set to the next block what range of logical addresses map into the disk block for 

which will be loaded by the blocking/deblocking operation. which the medium error occurred. If the data is determined 

If there are more blocks to load in block 912, control passes to have been successfully reconstructed by the DMA sync 

to block 906 for the next iteration of the loop. Otherwise, hardware in block 1206, writeback bits are set in the cache 

control passes to block 916, and the subroutine completes. 30 node descriptor corresponding to the logical address range 

FIG. 10 is a flowchart of the process which is performed where the error occurred and a writeback flush operation is 

when a write operation to the storage array has been queued for the cache node in order to write back the 

completed, successfully or unsuccessfully, by the controller. corrected data to the storage unit which reported the failure. 

^In the case of a failure, this process will only be called after If the data was not reconstructed, control passes to block 

, the number of retries prescribed by the errbrhandling poli- 35 1210 where the subroutine ADD BAD BLOCKS is called to 

cies of the controller have been performed; If the array has s register the affected blocks as invalid in the BBM MAP. If 

redundancy and errors occur on only one disk, the disk may the operation is determined to be a reconstruct in block 

be removed from operation prior to the invocation of this ^ 1212, it is given a RETRY status in block 1214. Since 

process, in which case the status of the write operation will reconstruction often crosses large extents which have never 

be considered to be good. The process begins in block 1002. 40 been used by the host computer and which are likely to 

In block 1004, the status of the operation is checked. If the contain medium errors since they have not recently been 

data have been successfully committed to all of the storage written, and since the array can only be restored to redun- 

units involved, control passes to block 1012, where the dancy by the completion of a reconstruction, it is desirable 

subroutine DELETE BAD BLOCKS is invoked for the to allow many such reassignments to occur before aborting 

range of logical addresses described by the host command 45 the operation. If the operation is determined to be a 

(not including blocking/deblocking heads or tails), indicat- blocking/deblocking read in block 1212, control passes to 

ing that the range of addresses contain valid host data. The block 1220 and the error, which has been documented in the 
subroutine then completes in block 1010. If the data was not ^BBM MAP, is ignored, allowing the associated host write to 

successfully committed to the array even after retries and/or proceed. Other operations are given an ABORT status in 

the removal of a redundant disk from operation, control 50 block 1216, and the steps prescribed by the controller's 

passes to block 1006, where the subroutine ADD BAD error-handling policy for aborted commands are taken, 

BLOCKS is called for the range of logical addresses '* including the sending of sense describing the error to the 

described by the command (including blocking/deblocking/ host, when appropriate. 

heads or tails)/ If the operation is determined to be a FIG. 13 is a flowchart of the CHECK RANGE subroutine, 

writethrough operation in block 1008, sense indicating the 55 which determines whether any logical blocks within a given 

failure of the write is sent to the host in block 1014. If the range on a given LUN are listed as invalid in the BBM MAP 

operation is a write-back operation, the host cannot be and, if so, returns the address of the first invalid block in the 

immediately informed of the error, so control passes to block ran ge. The subroutine begins in block 1302. In block 1304, 

1010 and the process completes. the BBM LIST is assigned to the local pointer variable 

FIG. 11 is a flowchart of the process which is performed 60 REGION. If REGION is null, the range to be checked does 

when a blocking/deblocking read operation completes with not overlap any bad blocks and the subroutine returns null in 

a failure status after the prescribed number of retries. By block 1308. If REGION is non-null, control passes to block 

invalidating the blocking/deblocking data rather than abort- 1310 where the first block of the invalid region designated 

ing the associated write command issued by the host by REGION is compared to the last block of the region to 

computer, this routine favors data from the host over the data 65 be checked, designated by XFR 13 REGION. There can be no 

sharing the same physical blocks in the storage array. This overlap if START(REGION)>END(XFR 13 REGION), in 

policy is based on the fact that the write data from the host which case control passes to block 1312, where the pointer 
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REGION is set to the next entry in the BBM LIST, and then 
to block 1306 for another iteration of the loop. If START 
(REGION)<°END(XFR 13 REGION), there is a possibility 
of overlap, so the last block of REGION is compared with 
the first block of XFR 13 REGION in block 1314. If END 5 
(REGION) >-START(XFR 13 REGION), the two regions 
overlap, so the greater of START(REGION) and START 
(XFR 13 REGION) is returned by the subroutine in block 
1316 as the first invalid block in the given range. Otherwise, 
there is no overlap, and control passes to block 1312 to ^ 
examine the next entry in the list. 

FIG. 14 is a flowchart of the subroutine ADD BAD 
BLOCKS, which is called when a region of the array is 
determined to be invalid. The subroutine begins in block 
1402 of FIG. 14a, In block 1404, the bit BBM DIRTY is set 15 
to indicate that a write of the BBM MAP to disk will need 
to occur. In block 1406, the BBM LIST is assigned to the 
local pointer variable REGION. If REGION is null in block 
1408, the range does not precede or overlap any entries in 
the BBM LIST and is inserted at the end of the list in block 2 q 
1410. The subroutine then completes in block 1412. If 
REGION is non-null, the first block of REGION is com- 
pared to the block after the last block of the region to be 
added, designated as NEW 13 REGION, in block 1414. If 
START(REGION)>l+END(NEW 13 REGION), then 25 
NEW 33 REGION precedes REGION, and the two do not 
need to be merged into a single descriptor. Control passes to 
block 1416, where the descriptor for NEW 13 REGION is 
inserted in front of REGION in the BBM LIST. The sub- 
routine then completes in block 1412. Otherwise, the block 30 
after the last block of REGION is compared with the first 
block of NEW 33 REGION. If END(REGION)+l<START 
(NEW 13 REGION), then REGION precedes 
NEW 13 REGION and the two do not need to be merged into 
a single descriptor. Control then passes to block 1420, where 35 
REGION is set to the next entry in the BBM LIST, and then 
to block 1408 for the next iteration of the loop. 

In the case where descriptors need to be merged, control 
passes through connector 1422 to block 1424 of FIG. 146, 
where a new local pointer variable, MERGE 13 REGI0N, 40 
which is used to determine whether any entries in the BBM 
TABLE are entirely contained within NEW 13 REGION and 
must be deleted, is initialized to point to the next entry after 
REGION in the BBM LIST. Because the list is ordered, 
MERGE 13 REGI0N is known to start after the first block of 45 
NEW 13 REGION. In block 1426, the first block of 
MERGE 13 REGI0N is compared to the block after the last 
block of NEW I3 REGION. If START(MERGE 13 REGION) 
<=l+END(NEW 33 REGION), MERGE 13 REGION is 
entirely contained within NEW 33 REGION and is merged 50 
with REGION in block 1432. The descriptor of 
MERGE 13 REGI0N can then be returned to the list of free 
descriptors in block 1434, and the pointer is advanced to the 
next entry in the fist in block 1436. This process repeats until 
MERGE 13 REGI0N is determined to be disjoint from 55 
NEW 33 REGI0N in block 1426, 

When all descriptors contained within NEW 13 REGION 
have been merged, control passes to block 1428, where the 
first block of REGION is set to be the lesser of the first block 
of REGION and the first block of NEW 13 REGION. In block 60 
1430, the last block of REGION is set to be the greater of 
the last block of REGION, which in the loop in the preced- 
ing paragraph will have been set to the last block of the last 
region to be merged, and the last block of NEW 13 REGI0N. 
The subroutine then completes in block 1412. $5 

FIG. 15 is a flowchart of the subroutine DELETE BAD 
BLOCKS, which is called when a write operation success- 
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fully commits data to the storage array. The subroutine 
begins in block 1502 of FIG. 15a. In block 1504, the BBM 
LIST is assigned to the local pointer variable REGION. If 
REGION is null in block 1506, the range of blocks is not 
listed in the BBM MAP, so the subroutine completes in 
block 1508. If REGION is non-null, but in block 1510 the 
first block of REGION succeeds the last block of the region 
to be deleted, designated as XFR 13 REGION, there can be no 
more entries in the list to be deleted, and the subroutine 
completes in block 1508. Otherwise, the last block of 
REGION is then compared to the first block of the region to 
be deleted in block 1512. If END(REGION)<START 
(XFR 13 REGION), the regions are disjoint, and control 
passes to block 1518, where REGION is set to the next entry 
in the BBM LIST, and then to block 1506 for the next 
iteration of the loop. Otherwise, there is overlap, and part or 
all of REGION must be deleted. In this case, the BBM 
DIRTY bit is set in block 1514 to indicate that the BBM 
MAP will need to be saved to disk and control passes 
through connector 1516 to block 1522. 

The flowchart of FIG. 15b shows the various cases of 
deletion of the portions of REGION which overlap 
XFR 13 REGION. In block 1522, it is checked whether 
REGION both ends before the end of XFR 13 REGI0N and 
begins after the beginning of XFR 33 REGI0N. In this case, 
REGION is entirely contained within XFR 13 REGI0N, and 
its descriptor is deallocated in block 1524. Control then 
passes through connector 1520 back to the main loop to 
check the next entry in the BBM liST. In block 1526, it is 
checked whether REGION ends before the end of 
XFR 13 REGION but starts before XFR 13 REGION. In this 
case, there are blocks at the head of REGION which are still 
invalid, so the end of REGION is set to the block before the 
first block of XFR 13 REGION. Control then passes through 
connector 1520 back to the main loop to check the next entry 
in the BBM LIST. In block 1530, it is checked whether 
REGION begins after the beginning of XFR 13 REGION but 
ends after the last block of XFR 33 REGION. In this case, 
there are blocks at the tail of REGION which are still 
invalid, so the start of REGION is set to the block after the 
last block of XFR 33 REGION. 

If none of the above are the case, XFR 13 REGION must be 
contained within REGION. In this case the descriptor of 
REGION should be broken in two. A new descriptor is 
allocated in block 1534. If no descriptor is available in block 
1536, REGION is truncated to end at the block before 
XFR J3 REGI0N in block 1542. If the allocation is 
successful, NEW 33 REGI0N is set to begin from the block 
after the end of XFR 13 REGION and end at the end of 
REGION in block 1538. NEW 33 REGION is then inserted in 
the list after REGION in block 1540. REGION is then 
truncated to end at the block before XFR 13 REGION in block 
1542. The subroutine then completes in block 1508. 

It will be apparent to those skilled in the art that the 
examples and embodiments described herein are by way of 
illustration and not of limitation, and that other examples 
may be used without departing from the spirit and scope of 
the present invention, as set forth in the claims. 

I claim: 

1 . A method for designating physically or logically invalid 
regions of storage units in a fault-tolerant storage device 
array comprising a plurality of failure independent storage 
units for storing information which receive information from 
a writeback-cache, and a controller having a writeback- 
cache comprising the steps: 

a. determining the logical address and length of each 
physically or logically invalid region, 
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b. writing the logical address and length of each physi- 
cally or logically invalid region on a bad region table, 
and 

c. replicating the bad region table on two or more but less 
than all of the storage units. 5 

2. A method for designating physically or logically invalid 
regions of storage units in a fault -tolerant storage device 
array comprising a plurality of failure independent storage 
units for storing information which receive information from 

a writeback-cache, and a controller having a writeback- 10 
cache comprising the steps: 

a. determining the logical address and length of each 
physically or logically invalid region, 

b. writing the logical address and length of each physi- 15 
cally or logically invalid region on a bad region table, 
and 

c. replicating the bad region table on stable storage units 
separate from the array of storage units. 

3. A method for designating as invalid either a whole or 2 o 
fractional number of blocks or regions on a plurality of 
storage units across which data has been striped after 
physical error or corruption on a storage unit or storage units 

in the plurality have occurred in a fault-tolerant storage 
device array comprising a plurality of failure independent 2 s 
storage units for storing information which receive infor- 
mation from a writeback cache, and a controller having a 
writeback-cache comprising the steps: 

a. determining the physical address and length of each 
block or region of physical error or corruption, 
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b. determining the set of logical blocks which map onto 
the region of physical corruption, 

c. determining the subset of the logical blocks from step 
b which are made logically invalid due to the physical 
error or corruption, and 

d. replicating the bad region table on two or more but less 
than all of the storage units. 

4. A method for designating as invalid either a whole or 
fractional number of blocks or regions on a plurality of 
storage units across which data has been striped after 
physical error or corruption on a storage unit or storage units 
in the plurality have occurred in a faulttolerant storage 
device array comprising a plurality of failure independent 
storage units for storing information which receive infor- 
mation from a writeback cache, and a controller having a 
writeback-cache comprising the steps: 

a. determining the physical address and length of each 
block or region of physical error or corruption, 

b. determining the set of logical blocks which map onto 
the region of physical corruption, 

c. determining the subset of the logical blocks from step 
b which are made logically invalid due to the physical 
error or corruption, 

d. recording the logical address and length of all logically 
invalid ranges of blocks, and 

e. replicating the bad region table on stable storage units 
separate from the array of storage units. 

***** 
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(57) ABSTRACT 

A system and technique for detecting and classifying data 
errors in a memory device. More specifically, hard and soft 
data errors in a memory device are detected by initiating a 
READ request initiated from a host controller. If an error is 
detected, the data is corrected and re-written to the corre- 
sponding memory location. A second READ request is then 
issued to read the corrected data. If a second error is detected 
in the corrected data, the error is classified as a hard error 
and a counter is incremented to track the number of hard 
errors detected in the system Once a programmable thresh- 
old number of hard errors are detected in a particular 
memory segment, an indicator, such as a light emitting diode 
(LED), is used to indicate that the corresponding memory 
segment should be replaced. 
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HARD/SOFT ERROR DETECTION 

CROSS-REFERENCE TO RELATED 
APPLICATION 

[0001] The present application is a continuation of appli- 
cation Ser. No. 09/769,958 filed on Jan. 25, 2001 which 
claims priority under 35 U.S.C §1 19(e) to provisional appli- 
cation 60/178,108 filed on Jan. 26, 2000. 

BACKGROUND OF THE INVENTION 
[0002] 1. Field of the Invention 

[0003] The present invention relates generally to memory 
protection, and more specifically to a technique for detecting 
errors in a memory device. 

[0004] 2. Description of the Related Art 

[0005] This section is intended to introduce the reader to 
various aspects of art which may be related to various 
aspects of the present invention which are described and/or 
claimed below. This discussion is believed to be helpful in 
providing the reader with background information to facili- 
tate a better understanding of the various aspects of the 
present invention. Accordingly, it should be understood that 
these statements are to be read in this light, and not as 
admissions of prior art. 

[0006] Semiconductor memory devices used in computer 
systems, such as dynamic random access memory (DRAM) 
devices, generally comprise a large number of capacitors 
which store binary data in each memory device in the form 
of a charge. These capacitors are inherently susceptible to 
errors. As memory devices get smaller and smaller, the 
capacitors used to store the charges also become smaller 
thereby providing a greater potential for errors. 

[0007] Memory errors are generally classified as "hard 
errors" or "soft errors." Hard errors are generally caused by 
issues such as poor solder joints, connector errors, and faulty 
capacitors in the memory device. Hard errors are reoccurring 
errors which generally require some type of hardware cor- 
rection such as replacement of a connector or memory 
device. Soft errors, which cause the vast majority of errors 
in semiconductor memory, are transient events wherein 
extraneous charged particles cause a change in the charge 
stored in one of the capacitors in the memory device. When 
a charged particle, such as those present in cosmic rays, 
comes in contact with the memory circuit, the particle may 
change the charge of one or more memory cells, without 
actually damaging the device. Because these soft errors are 
transient events, generally caused by alpha particles or 
cosmic rays for example, the errors are not generally repeat- 
able and are generally related to erroneous charge storage 
rather than hardware errors. For this reason, soft errors, if 
detected, may be corrected by rewriting the erroneous 
memory cell with correct data. Uncorrected soft errors will 
generally result in unnecessary system failures. Further, soft 
errors may be mistaken for more serious system errors and 
may lead to the unnecessary replacement of a memory 
device. By identifying soft errors in a memory device, the 
number of memory devices which are actually physically 
error free and are replaced due to mistaken error detection 
can be mitigated, and the errors may be easily corrected 
before any system failures occur. 



[0008] Memory errors can be categorized as either single- 
bit or multi-bit errors. A single bit error refers to an error in 
a single memory cell. Single-bit errors can be detected and 
corrected by standard Error Code Correction (ECC) meth- 
ods. However, in the case of multi-bit errors, which affect 
more than one bit, standard ECC methods may not be 
sufficient. In some instances, ECC methods may be able to 
detect multi-bit errors, but not correct them. In other 
instances, ECC methods may not even be sufficient to detect 
the error. Thus, multi-bit errors must be detected and cor- 
rected by a more complex means since a system failure will 
typically result if the multi-bit errors are not detected and 
corrected. 

[0009] Regardless of the classification of memory error 
(hard/soft, single-bit/multi-bit), the current techniques for 
detecting the memory errors have several drawbacks. Typi- 
cal error detection techniques rely on READ commands 
being issued by requesting devices, such as a peripheral disk 
drive. Once a READ command is issued to a memory sector, 
a copy of the data is read from the memory sector and tested 
for errors en route to delivery to the requesting device. 
Because the testing of the data in a memory sector only 
occurs if a READ command is issued to that sector, seldom 
accessed sectors may remain untested indefinitely. Harmless 
single-bit errors may align over time resulting in uncorrect- 
able multi-bit errors. Once a READ request is finally issued 
to a seldom accessed sector, previously correctable errors 
may have evolved into uncorrectable errors thereby causing 
unnecessary data corruption or system failures. Early error 
detection may significantly reduce the occurrences of uncor- 
rectable errors and prevent future system failures. 

[0010] Further, in redundant memory systems, undetected 
memory errors may pose an additional threat. Certain opera- 
tions, such as hot-plug events, may require that the system 
transition from a redundant to a non-redundant state. In a 
non-redundant state, memory errors which were of little 
concern during a redundant mode of operation, may become 
more significant since errors that were correctable during a 
redundant mode of operation may no longer be correctable 
while the system operates in a non-redundant state. 

[00U] The present invention may address one or more of 
the concerns set forth above. 

BRIEF DESCRIPTION OF THE DRAWINGS 

[0012] The foregoing and other advantages of the inven- 
tion will become apparent upon reading the following 
detailed description and upon reference to the drawings in 
which: 

[0013] FIG. 1 is a block diagram illustrating an exemplary 
computer system; 

[0014] FIG. 2 illustrates an exemplary memory device 
used in the present system; 

[0015] FIG. 3 generally illustrates a cache line and 
memory controller configuration in accordance with the 
present technique; 

[0016] FIG. 4 generally illustrates the implementation of 
a RAID memory system to recreate erroneous data words; 

[0017] FIG. 5 illustrates an exemplary memory sub-sys- 
tem in accordance with the present technique; and 
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[0018] FIG. 6 is a block diagram illustrating an exemplary 
architecture associated with a computer system in accor- 
dance with the present technique. 

DETAILED DESCRIPTION OF SPECIFIC 
EMBODIMENTS 
[0019] One or more specific embodiments of the present 
invention will be described below. In an effort to provide a 
concise description of these embodiments, not all features of 
an actual implementation are described in the specification. 
It should be appreciated that in the development of any such 
actual implementation, as in any engineering or design 
project, numerous implementation-specific decisions must 
be made to achieve the developers' specific goals, such as 
compliance with system-related and business-related con- 
straints, which may vary from one implementation to 
another. Moreover, it should be appreciated that such a 
development effort might be complex and time consuming, 
but would nevertheless be a routine undertaking of design, 
fabrication, and manufacture for those of ordinary skill 
having the benefit of this disclosure. 

[0020] Turning now to the drawings, and referring initially 
to FIG. 1, a multiprocessor computer system, for example 
a Proliant 8500 PCI-X from Compaq Computer Corpora- 
tion, is illustrated and designated by the reference numeral 
10. In this embodiment of the system 10, multiple processors 
11 control many of the functions of the system 10. The 
processors 11 may be, for example, Pentium, Pentium Pro, 
Pentium II Xeon (Slot- 2), or Pentium III processors avail- 
able from Intel Corporation. However, it should be under- 
stood that the number and type of processors are not critical 
to the technique described herein and are merely being 
provided by way of example. 

[0021] Typically, the processors 11 are coupled to a pro- 
cessor bus 12. As instructions are sent and received by the 
processors 11, the processor bus 12 transmits the instruc- 
tions and data between the individual processors 11 and a 
host controller 13. The host controller 13 serves as an 
interface directing signals between the processors U, cache 
accelerators 14, a memory controller 15 (which may be 
comprised of one or more memory control devices as 
discussed with reference to FIGS. 5 and 6), and an I/O 
controller 19. Generally, ASICs are located within the host 
controller 13. The host controller 13 may include address 
and data buffers, as well as arbitration and bus master 
control logic. The host controller 13 may also include 
miscellaneous logic, such as error detection and correction 
logic. Furthermore, the ASICs in the host controller may 
also contain logic specifying ordering rules, buffer alloca- 
tion, specifying transaction type, and logic for receiving and 
delivering data. When the data is retrieved from the memory 
16, the instructions are sent from the memory controller 15 
via a memory bus 17. The memory controller 15 may 
comprise one or more suitable standard memory control 
devices or ASICs. 

[0022] The memory 16 in the system 10 is generally 
divided into groups of bytes called cache lines. Bytes in a 
cache line may comprise several variable values. Cache lines 
in the memory 16 are moved to a cache for use by the 
processors 11 when the processors 11 request data stored in 
that particular cache line. 

[0023] The host controller 13 is coupled to the memory 
controller 15 via a memory network bus 18. As mentioned 



above, the host controller 13 directs data to and from the 
processors U through the processor bus 12, to and from the 
memory controller 15 through the network memory bus 18, 
and to and from the cache accelerator 14. In addition, data 
may be sent to and from the I/O controller 19 for use by 
other systems or external devices. The I/O controller 19 may 
comprise a plurality of PCI-bridges, for example, and may 
include counters and timers as conventionally present in 
personal computer systems, an interrupt controller for both 
the memory network and I/O buses, and power management 
logic. Further, the I/O controller 19 is coupled to multiple 
I/O buses 20. Finally, each I/O bus 20 terminates at a series 
of slots or I/O interface 21. 

[0024] Generally, a transaction is initiated by a requester, 
e.g., a peripheral device, via the I/O interface 21. The 
transaction is then sent to one of the I/O buses 20 depending 
on the peripheral device utilized and the location of the I/O 
interface 21. The transaction is then directed towards the I/O 
controller 19, Logic devices within the I/O controller 19 
generally allocate a buffer where data returned from the 
memory 16 may be stored. Once the buffer is allocated, the 
transaction request is directed towards the processor 11 and 
then to the memory 16. Once the requested data is returned 
from the memory 16, the data is stored within a buffer in the 
I/O controller 19. The logic devices within the I/O controller 
19 operate to read and deliver the data to the requesting 
peripheral device such as a tape drive, CD-ROM device or 
other storage device. 

[0025] A system 10, such as a computer system, generally 
comprises a plurality of memory modules, such as Dual 
Inline Memory Modules (DIMMs). A standard DIMM may 
include a plurality of memory devices such as Dynamic 
Random Access Memory Devices (DRAMs). In an exem- 
plary configuration, a DIMM may comprise nine memory 
devices on each side of the DIMM 22. FIG. 2 illustrates one 
side of a DIMM 22 which includes nine DRAMs 23. The 
second side of the DIMM 22 may be identical to the first side 
and may comprise nine additional DRAM devices (not 
shown). Each DIMM 22 access generally accesses all 
DRAMs 23 on the DIMM 22 to produce a data word. For 
example, a DIMM 22 comprising x4 DRAMs 23 (DRAMs 
passing 4-bits with each access) will produce 72-bit data 
words. System memory is generally accessed by CPUs and 
I/O devices as a cache line of data. A cache line generally 
comprises several 72-bit data words. Thus, each DIMM 22 
accessed on a single memory bus provides a cache line of 
72-bit data words 24. 

[0026] Each of the 72 bits in each of the data words 24 is 
susceptible to soft errors. Different methods of error detec- 
tion may be used for different memory architectures. The 
present method and architecture incorporates a Redundant 
Array of Industry Standard DIMs (RAID). As used herein, 
RAID memory refers to a "4+1 scheme" in which a parity 
word is created using an XOR module such that any one of 
the four data words can be re-created using the parity word 
if an error is detected in one of the data words. Similarly, if 
an error is detected in the parity word, the parity word can 
be re-created using the four data words. By using the present 
RAID memory architecture, not only can multi-bit errors be 
easily detected and corrected, but it also provides a system 
in which the memory module alone or the memory module 
and associated memory controller can be removed and/or 
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replaced while the system is running (i.e. the memory 
modules and controllers are hot-pluggable). 

[0027] FIG. 3 illustrates one implementation of RAID 
memory. RAID memory stripes a cache line of data 25 such 
that each of the four 72-bit data words 26, 27, 28, and 29 is 
transmitted through a separate memory control device 30, 
31, 32, and 33. A fifth parity data word 34 is generated from 
the original cache line 25. Each parity word 34 is also 
transmitted through a separate memory control device 35. 
The generation of the parity data word 34 from the original 
cache line 25 of data words 26, 27, 28, and 29 can be 
illustrated by way of example. For simplicity, four-bit data 
words are illustrated. However, it should be understood that 
these principals are applicable to 72-bit data words, as in the 
present system, or any other useful word lengths. Consider 
the following four data words: 

[0028] DATA WORD 1: 1 0 1 1 

[0029] DATA WORD 2: 0 0 1 0 

[0030] DATA WORD 3: 1 0 0 1 

[0031] DATA WORD 4: 0 1 1 1 

[0032] A parity word can be either even or odd. To create 
an even parity word, common bits are simply added 
together. If the sum of the common bits is odd, a "1" is 
placed in the common bit location of the parity word. 
Conversely, if the sum of the bits is even, a zero is placed in 
the common bit location of the parity word. In the present 
example, the bits may be summed as follows: 

[0033] DATA WORD 1: 1 0 1 1 

[0034] DATA WORD 2: 0 0 1 0 

[0035] DATA WORD 3: 1 0 0 1 

[0036] DATA WORD 4: 0 1 1 1 

[0037] 2 13 3 

[0038] Parity Word: 0 111 

[0039] When summed with the four exemplary data 
words, the parity word 0111 will provide an even number of 
active bits (or "l's") in every common bit. This parity word 
can be used to recreate any of the data words (1-4) if a soft 
error is detected in one of the data words as further explained 
with reference to FIG. 4. 

[0040] FIG. 4 illustrates the re-creation of a data word in 
which a soft error has been detected in a RAID memory 
system. As in FIG. 3, the original cache line 25 comprises 
four data words 26, 27, 28, and 29 and a parity word 34. 
Further, the memory control device 30, 31, 32, 33, and 35 
corresponding to each data word and parity word are illus- 
trated. In this example, a data error has been detected in the 
data word 28. A new cache line 36 can be created using data 
words 26, 27, and 29 along with the parity word 34 using an 
exclusive-OR (XOR) module 37. By combining each data 
word 26, 27, 29 and the parity word 34 in the XOR module 
37, the data word 28 can be re-created. The new and correct 
cache line 34 thus comprises data words 26, 27, and 29 
copied directly from the original cache line 25 and data word 
2Ha (which is the re-created data word 28) which is pro- 
duced by the XOR module 37 using the error-free data 
words (26, 27, 29) and the parity word 34. It should also be 



clear that the same process may be used to re-create a parity 
word 34 if an error is detected therein using the four 
error-free data words. 

[0041] Similarly, if the memory control device 32, which 
is associated with the data word 28, is removed during 
operation (i.e. hot-plugging) the data word 28 can similarly 
be re-created. Thus, any single memory control device can 
be removed while the system is running or any single 
memory control device can return a bad data word and the 
data can be re-created from the other four memory words 
using an XOR module. 

[0042] FIG. 5 illustrates one embodiment of a memory 
sub-system 40, which incorporates a redundant (4+1) 
scheme. The memory sub-system 40 comprises five memory 
cartridges 42a-e. Memory cartridge 42e, for example, may 
be used for parity storage. The memory cartridge 42a 
includes eight DIMMs 44 mounted thereon. Each DIMM 44 
includes nine memory devices, such as DRAMs 46 on each 
side of the DIMM substrate. (FIG. 5 illustrates only one side 
of the DIMM 44.) Further, the memory cartridge 42a has a 
memory control device 48a mounted thereon. It should be 
understood that each memory cartridge 42a-e includes a 
plurality of DIMMs 44 and a corresponding memory control 
device 48. The memory cartridges 42a -e may be mounted on 
a memory system board 50 via connectors S2a-e to create 
the memory sub-system 40. The memory sub-system 40 can 
be incorporated into a computer system via an edge con- 
nector 54 or by any suitable means of providing a data path 
from the computer system to the memory storage devices 46. 
It should be evident that each of the memory cartridges 
42a-e may be removed (hot-plugged) from the memory 
sub-system 40. By removing a memory cartridge such as 
memory cartridge 42a from the memory sub -system 40, the 
computer system will transition from a redundant mode of 
operation (implementing the fifth memory cartridge) to a 
non-redundant state. When transitioning from a redundant to 
a non-redundant mode of operation during a hot-plug 
memory event, it may be advantageous to verify that no 
errors exist in the remaining memory cartridges 42b-e. Thus, 
immediately proceeding the removal of the memory car- 
tridge 42a, a verify procedure may be advantageously 
implemented. 

[0043] Further, a verify procedure may be advantageous in 
checking for memory errors in certain areas of memory 
which may sit idle for an extended period of time, allowing 
accumulation of errors or the growth of a single bit error to 
an uncorrectable multi-bit error. The verify procedure is 
implemented through a piece of logic which may reside in 
the memory sub-system 40. The verify logic can be pro- 
grammed to verify a specific region of memory such as the 
contents of a single memory cartridge 42a-e or to verify the 
validity of the entire memory. The verify procedure relies on 
the normal ECC and error logging mechanisms to validate 
the health of the memory sub-system 40. The verify routine 
may be exercised by an operator instruction, as part of a 
sequence of memory operations (such as a hot-plug event), 
or based on a predetermined schedule. Simply put, the verify 
logic will read a defined memory region. If errors are 
detected they may be recorded and corrected, as further 
discussed below with reference to FIG. 6. Verify may then 
be executed again to validate that the correction mechanism 
in fact corrected the errors that were reported. The verify 
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logic may reside ia each memory controlled device 4Sa-e or 
on the memory system board 50. 

[0044] FIG. 6 is a block diagram illustrating one embodi- 
ment of the verify technique which incorporates the RAID 
memory architecture. As previously described, a computer 
system includes a memory sub-system 40 comprising 
memory cartridges 42a-e. As described with reference to 
FIG. 5, each memory cartridge 42a-e may include a 
memory control device 48a-e (shown in FIG. 5). Thus, to 
access the memory devices 46 (shown in FIG. 5) in memory 
cartridge 42a 9 a READ command is issued and data is 
passed through the memory control device 48a, and so forth. 

[0045] Each memory control device 4Sa-e may comprise 
ECC fault tolerance capability. As data is passed from the 
memory sub-system 40 to the host controller 58 via a 
memory network bus 60, each data word being produced by 
a memory cartridge 42a-e is checked for single bit memory 
errors in each respective memory control device 4Sa-e 
(residing on each respective memory cartridge 42a-e) by 
typical ECC methods. If no errors are detected, the data is 
simply passed to the host controller 58 and eventually to a 
requesting device via an OUTPUT 68. If a single -bit error is 
detected by a memory control device 4%a-e, the data is 
corrected by the memory control device 48a-e. When the 
corrected data is sent to the host controller 58 via the 
memory network bus 60, error detection and correction 
devices 62a-e, which reside in the first controller 58 and may 
be identical to the ECC devices in the memory control 
devices 4$a-e, will not detect any erroneous data words 
since the single -bit errors have been corrected by the 
memory control devices 4Sa-e in the memory sub-system 
40. Therefore, if an error is detected and corrected by the 
memory control devices 48a -e, a message is sent from the 
memory control devices 48 are to the host controller 58 
indicating that a memory error has been detected and 
corrected and that the corresponding memory cartridge 
42a-e should be over-written with corrected data, as dis- 
cussed in more detail below. 

[0046] In an alternate embodiment, the error detection 
capabilities in the memory control devices 48a-e may be 
turned off or eliminated. Because the host controller 58 also 
includes error detection and correction devices 62a -e, any 
single bit errors can still be corrected using the standard 
ECC methods available in the host controller 58. Further, it 
is possible that errors may be injected while the data is on 
the memory network bus 60. In this instance, even if the 
error detection capabilities are turned on in the memory 
control devices 4Ha-e, the memory control devices 4Sa-e 
will not detect an error since the error is injected after the 
data has passed from the memory sub-system 40. Advanta- 
geously, since the host controller 58 includes similar or even 
identical error detection and correction devices 62a-e, the 
errors can be detected and corrected in the host controller 58. 

[0047] If a multi-bit error is detected in one of the memory 
control devices 48a -e, the memory control device 48a -e, 
with standard ECC capabilities, can detect the errors but will 
not be able to correct the data error. Therefore, the erroneous 
data is passed to the error detection and correction devices 
G2a-e. Like the memory control devices 48a-e, the error 
detection and correction devices 62a-e, which also have 
typical ECC detection, can only detect but not correct the 
multi-bit errors. The erroneous data words may be passed to 



the RAID memory engine 64 via some READ/WRITE 
control logic 66, for correction. 

[0048] In a typical memory READ operation, the host 
controller 58 will issue a READ command on the memory 
network bus 60, the READ command originating from an 
external device such as a disk drive. The memory control 
devices 4Sa-e receive the request and retrieve the data from 
the corresponding memory cartridge 42a ~e. The data is then 
passed from the memory sub-system 40 to the host control- 
ler 58. As described above, single-bit errors may either be 
corrected in the memory control devices 4$a-e or the 
detection and correction devices 62a -e. The RAID memory 
engine 64 will correct the multi-bit errors, as described 
above. The corrected data will be delivered from the host 
controller 58 to the requesting controller or I/O device via an 
OUTPUT 68. 

[0049] It should be evident from the discussion above, that 
performing error detection and correction on data residing in 
the memory sub-system 40 by relying on READ operations 
sent from peripheral devices will only result in detection of 
errors on those devices from which data is read. By relying 
on the READ command from a peripheral device, certain 
areas of memory may sit idle for extended periods thereby 
allowing data errors to accumulate undetected. To address 
this issue, an additional piece of logic may reside in the 
memory sub-system 40. The verify logic 70 initiates a 
routine based on an operator instruction, a pre-determined 
periodic instruction, or some sequence of events such as a 
hot-plug event, for example. The verify logic 70 initiates a 
check of the specified memory location in the memory 
subsystem 40 without depending on normal READ 
accesses by external devices. 

[0050] The verify logic 70 initiates a verify procedure 
through an arbiter 72 in the host controller 58. The arbiter 72 
is generally responsible for prioritizing accesses to the 
memory sub-system 40. A queue comprises a plurality of 
requests such as memory READ, memory WRITE, memory 
verify, and memory scrubs (discussed further below), for 
example. The arbiter 72 prioritizes the requests and other- 
wise manages the queue. The verify logic 70 essentially 
initiates its own internal READ command to check specified 
regions of the memory sub-system 40. Once the verify logic 
70 initiates a request to the arbiter 72, the verify procedure 
is scheduled in the queue. The request will pass through the 
READ/WRITE control logic 66 and to the memory sub- 
system 40. The specified memory locations in the memory 
sub-system 40 will be read and any errors will be detected 
and/or corrected by the means described above with refer- 
ence to the READ command issued by a peripheral device. 
The verify procedure implemented by the verify logic 70 can 
be initiated in a variety of ways. For instance, a user may be 
able to check specified memory locations by pulling up a 
window on an operating system. The window may allow a 
user to specify what locations in memory the user would like 
checked. By providing a user with the ability to check 
specified memory locations, the verify procedure provides 
user confidence in the validity of data stored in the memory 
sub-system 40. 

[0051] Alternately, the verify procedure may be a periodi- 
cally scheduled event. In this instance, the verify logic 70 
may include a timer and a buffer for storing a list of each 
address location in the memory sub-system 40. At pro- 
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grammed or specified time intervals, the verify logic 70 may 
initiate READ commands to the arbiter 72 to verify the data 
stored in the corresponding address locations in the memory 
sub -system 40. The verify logic 70 may initiate READ 
commands through successive addresses in the memory 
sub-system 40 such that every memory address is eventually 
checked. The verify logic 70 thus may insure that all address 
locations in the memory sub-system 40 or a specified set of 
address locations are periodically checked for validity. Fur- 
thermore, the READ command issued by the verify logic 70 
may be scheduled as a low priority thread in the arbiter 72 
to minimize system impact. In this way, the verify procedure 
may only be run during periods of low system activity (e.g. 
when the queue in the arbiter 72 does not include READ/ 
WRITE requests from external devices). 

[0052] Yet another implementation of the verify logic 70 
includes a verify operation to validate a memory cartridge 
when the memory sub-system 40 is switching from a non- 
redundant mode of operation to a redundant mode of opera- 
tion (i.e. during a hot-plug event). For example, referring 
back to FIG. 5, the memory cartridges 42b-e are currently 
connected to the memory system board 50. Assuming that 
the memory system board 50 is operably coupled to a host 
system including a host controller 58 (as illustrated in FIG. 
6), the memory sub-system 40 is operating in a non- 
redundant mode since there is no additional memory car- 
tridge 42a to be used for parity. If a memory cartridge 42a 
is installed into the memory sub -system 40, it may be 
advantageous to verify the memory devices 46 residing on 
the memory cartridge 42a. The verify logic 70 can be 
implemented to check each address location on the memory 
devices 46 on the memory cartridge 42a before the system 
transitions to a redundant mode of operation. 

[0053] First, the verify logic 70 initializes the memory 
cartridge 42a by writing zeros to each address location in the 
memory cartridge 42a. The verify logic 70 schedules the 
initialization WRITEs through the arbiter 70. Next, the 
verify logic 70 rebuilds the memory cartridge 42a by using 
the techniques described in FIGS. 3 and 4 to recreate the 
parity data that should be stored in the memory cartridge 
42a. As previously described, each cache line of data from 
the memory cartridges 42b-e are used to recreate the parity 
cache line by using the XOR module in the RAID memory 
engine 64. Each recreated cache line is then written to the 
corresponding location in the memory cartridge 42a. Finally, 
once lie data in the memory cartridge 42a is rebuilt, the 
verify logic 70 may initiate a READ to insure that the data 
that should have been written to the memory cartridge 42a 
was in fact stored there. This procedure can be performed by 
again using the data stored in the memory cartridges 42b-e 
to again recreate the data that should be stored in the 
memory cartridge 42a, and then by comparing those values 
to the values that were stored in the memory cartridge 42a 
during the rebuild procedure. If the data does not match an 
error message may be provided to a user indicating that a 
DIMM on the memory cartridge 42a may be bad. If there are 
no errors found in the new memory cartridge 42a, the system 
may switch from a non -redundant mode of operation to a 
redundant mode of operation. 

[0054] To this point, error detection via peripheral READ 
commands and READ commands implemented by the 
verify logic 70 have been discussed. The memory control 
devices 48a-e, the error detection and correction devices 



62a -e and the RAID memory engine 64 can be used to 
correct the data before it is written to the output 68. 
However, at this point the data residing in the memory 
sub-system 40 may still be corrupted. To rectify this prob- 
lem, the data in the memory sub-system 40 may be over- 
written or "scrubbed." For every data word in which a single 
bit error is detected and flagged by the memory control 
devices 4Sa-e or the error detection and correction devices 
62a-e, a request is sent to the scrubbing control logic 74 
indicating that the corresponding memory location should be 
scrubbed during a subsequent WRITE operation initiated by 
the scrubbing control logic 74. Similarly, if a multi-bit error 
is detected by the error detection and correction devices 
62a -e, the data is corrected through the RAID memory 
engine 64, and the scrubbing control logic 74 is notified by 
the corresponding error detection and correction device 
62a-e that the corresponding memory location in the 
memory sub-system 40 should be scrubbed. If a single-bit 
error is detected in one of the memory control devices 48a-e t 
or a multi-bit error is detected in one of the error detection 
and correction devices 62a-e a message is sent to the 
scrubbing control logic 74 indicating that an erroneous data 
word has been detected. At this time, the corrected data word 
and corresponding address location are sent from the RAID 
memory engine 64 to a buffer 76 which is associated with 
the scrubbing process. The buffer 76 is used to store the 
corrected data and corresponding address location tempo- 
rarily until such time that the scrubbing process can be 
implemented. Once the scrubbing control logic 74 receives 
an indicator that a corrupted data word has been detected and 
should be corrected in the memory sub-system 40, a request 
is sent to the arbiter 72 which schedules and facilitates all 
accesses to the memory sub-system 40. To insure proper 
timing and data control, each time a data word is rewritten 
back to the memory sub-system 40, an entire cache fine may 
be rewritten into each of the corresponding memory car- 
tridges 42a-e in the subsystem 40 rather than just rewriting 
the erroneous data word. The scrubbing logic can be used to 
rewrite the locations in the memory sub-system 40 when 
errors are found during a typical READ operation or a verify 
procedure initiated by the verify logic 70. 

[0055] Further, the host controller 58 may include a con- 
tent addressable memory (CAM) controller 78. The CAM 
controller 78 provides a means of insuring that memory 
WRITEs are only performed when necessary. Because many 
READ and WRITE requests are active at any given time on 
the memory network bus 60 and because a scrubbing 
operation to correct corrupted data may be scheduled after a 
WRITE to the same memory location, the CAM controller 
78 will compare all outstanding WRITE requests to subse- 
quent memory scrub requests which are currently scheduled 
in the queue. It is possible that a corrupted memory location 
in the memory sub-system 40 which has a data scrub request 
waiting in the queue may be overwritten with new data prior 
to the scrubbing operation to correct the old data previously 
present in the memory sub-system 40. In this case, the CAM 
controller 78 will recognize that new data has been written 
to the address location in the memory sub-system 40 by 
implementing a simple compare function between the 
addresses and will cancel the scheduled scrubbing operation. 
The CAM controller 78 will insure that the old corrected 
data does not over- write new data which has been stored in 
the corresponding address location in the memory sub- 
system 40. 
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[0056] It should be noted that the error detection and 
scrubbing technique described herein may not distinguish 
between soft and hard errors. While corrected data may still 
be distributed through the output of the host controller 58, if 
the errors are hard errors, the scrubbing operation to correct 
the erroneous data words in the memory sub-system 40 will 
be unsuccessful. To solve this problem, software in the host 
controller 58 may track the number of data errors associated 
with a particular data word or memory location. After some 
pre-determined number of repeated errors are detected in the 
same data word or memory location, the host controller 58 
may send an error message to a user or illuminate an LED 
corresponding to the device in which the repeat error is 
detected. 

[0057] While the invention may be susceptible to various 
modifications and alternative forms, specific embodiments 
have been shown by way of example in the drawings and 
will be described in detail herein. However, it should be 
understood that the invention is not intended to be limited to 
the particular forms disclosed. Rather, the invention is to 
cover all modifications, equivalents and alternatives falling 
within the spirit and scope of the invention as defined by the 
following appended claims. 

What is claimed is: 

1. A method of tracking errors in a memory system 
comprising the acts of: 

detecting an error in a semiconductor memory segment; 

determining an error type, the error type being one of a 
soft error and a hard error; 

counting the number of hard errors detected in the 
memory segment; and 

indicating that a threshold number of hard errors has been 
reached. 

2. The method of tracking errors, as set forth in claim 1, 
wherein the act of detecting an error comprises the act of 
detecting an error using an ECC algorithm. 

3. The method of tracking errors, as set forth in claim 1, 
wherein the act of detecting an error comprises the act of 
detecting an error in a dual inline memory module (DIMM). 

4. The method of tracking errors, as set forth in claim 1, 
wherein the act of detecting an error comprises the act of 
detecting an error during execution of a READ request. 

5. The method of tracking errors, as set forth in claim 1, 
wherein the act of detecting an error type comprises the acts 
of: 

writing corrected data to a memory segment address 
corresponding to the error; 

reading the corrected data from the memory segment 
address corresponding to the error; and 

performing error detection on the corrected data read from 
the memory segment address corresponding to the 
error, wherein if a second error is detected, defining the 
error as a hard error. 

6. The method of tracking errors, as set forth in claim 1, 
wherein the act of counting comprises the act of increment- 
ing a counter each time a hard error is detected. 

7. The method of tracking errors, as set forth in claim 1, 
wherein the act of indicating comprises the act of illumi- 
nating a light emitting diode (LED). 



8. The method of tracking errors, as set forth in claim 1, 
wherein the act of indicating comprises the act of indicating 
that the threshold number of hard errors has been reached, 
the threshold number corresponding to an indication that the 
memory segment having the hard errors should be replaced, 

9. The method of tracking errors, as set forth in claim 1, 
comprising the act of selecting the threshold, wherein the 
threshold number corresponds to a user-selectable maximum 
number of hard errors corresponding to an indication that the 
memory segment having the hard errors should be replaced. 

10. An error detection system comprising: 

a plurality of semiconductor memory segments; 

a plurality of memory controllers, wherein each of the 
memory controllers is operably coupled to a corre- 
sponding one of the plurality of memory segments and 
configured to initiate requests to the respective memory 
segment; 

error detection logic configured to detect errors during 
execution of a first READ request, wherein the errors 
comprise one of a soft error and a hard error in the 
plurality of memory segments; and 

a counting device configured to count only when a hard 
error is detected. 

11. The error detection system, as set forth in claim 10, 
wherein each of the plurality of memory segments com- 
prises a dual inline memory module (DIMM). 

12. The error detection system, as set forth in claim 11, 
comprising a light emitting diode (LED) corresponding to 
each of the dual inline memory modules (DIMMs), wherein 
each of the LEDs is configured to illuminate in response to 
the counting device reaching a threshold number N of hard 
errors for the respective DIMM. 

13. The error detection system, as set forth in claim 12, 
wherein the threshold number N is user-programmable. 

14. The error detection system, as set forth in claim 10, 
wherein each of the plurality of memory controllers is 
configured to initiate a WRITE request in response to an 
error being detected, the WRITE request being initiated to 
write corrected data to an address corresponding to the 
detected error. 

15. The error detection system, as set forth in claim 14, 
wherein each of the plurality of memory controllers is 
configured to initiate a second READ request after the 
WRITE request, the READ request being initiated to read 
the corrected data. 

16. The error detection system, as set forth in claim 15, 
wherein the error detection logic is configured to detect 
errors during execution of the second READ request. 

17. The error detection system, as set forth in claim 16, 
wherein the counting device is configured to count when an 
error is detected in the data corresponding to the second 
READ request. 

18. The error detection system, as set forth in claim 10, 
wherein each of the plurality of memory controllers com- 
prises the error detection logic. 

19. A method of manufacturing a memory system, com- 
prising the acts of: 

providing a device to detect hard errors in the memory 
system; 

providing a device to count the number of hard errors 
detected in the memory system; and 
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providing an indication device to indicate that a threshold 
number of hard errors have been counted. 

20. The method of manufacturing, as set forth in claim 19, 
comprising the acts of: 

providing a plurality of memory segments; and 

providing a plurality of memory controllers, each of the 
plurality of memory controllers corresponding to one of 
the plurality of memory segments, and wherein each of 
the plurality of memory controllers is configured to 
provide access to the memory segments. 

21. The method of manufacturing, as set forth in claim 19, 
wherein the act of providing a device to detect hard errors 
comprises the act of providing a device comprising an FCC 
algorithm. 

22. The method of manufacturing, as set forth in claim 19, 
wherein the act of providing a device to detect hard errors 
comprises the act of providing a memory controller com- 
prising and ECC algorithm. 



23. The method of manufacturing, as set forth in claim 20, 
wherein the act of providing a device to count the number of 
hard errors detected comprises the act of providing a counter 
configured to increment by one each time a hard error is 
detected. 

24. The method of manufacturing, as set forth in claim 20, 
comprising providing a configuration register configured to 
store the threshold number of errors, the threshold number 
of errors corresponding to a maximum number of errors that 
may be detected without indicating a memory segment error. 

25. The method of manufacturing, as set forth in claim 20, 
wherein the act of providing an indication device to indicate 
that a threshold number of hard errors comprises the act of 
providing a light emitting diode (LED) corresponding to 
each of the plurality of memory segments and configured to 
illuminate when the threshold number of errors has been 
detected. 

* * * * . * 
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